sigma <- 4
nx <- 130
ny <- 803 Statistical Analysis
In this lab, we will formally answer the following aims in our statistical report.
Is there a difference between the mean yields of potatoes from fields that were fertilised and fields that were unfertilised?
Is there a difference between the mean yields of wheat from fields that were fertilised and fields that were unfertilised?
In order to do this, we need to learn some new methods of calculating confidence intervals which will allow us to calculate a 95% confidence interval for the difference between the mean yield in each case.
3.1 Difference in Population Means
In general, when we have two independent random variables which both follow normal distributions, say \(X\sim N(\mu_X,\,\sigma_X)\) and \(Y\sim N(\mu_Y,\,\sigma_Y)\), it is often of interest to find a range of values for the difference between the population means, \(\mu_X\) and \(\mu_Y\). This range of values can be found using a confidence interval for \(\mu_X-\mu_Y\), but we have to be aware of the variances of both distributions, as this can change the way the confidence interval is calculated. In order to construct these confidence intervals, random samples have to be drawn from the underlying distributions.
We will use the following notation in the formulae for confidence intervals for the difference in population means,
- \(n_X\): the size of the sample drawn from \(X\sim N(\mu_X,\,\sigma_X)\).
- \(n_Y\): the size of the sample drawn from \(Y\sim N(\mu_Y,\,\sigma_Y)\).
- \(x_1, x_2, \ldots , x_{n_X}\): the sample taken from \(X\sim N(\mu_X,\,\sigma_X)\) with sample mean \(\bar{x}\) and sample variance \(s^2_X\).
- \(y_1, y_2, \ldots , y_{n_Y}\): the sample taken from \(Y\sim N(\mu_Y,\,\sigma_Y)\) with sample mean \(\bar{y}\) and sample variance \(s^2_Y\).
3.1.1 Variances are known and equal (8.2.3)
In the case where the population variances of both distributions are known and are equal, that is \(\sigma_X^2=\sigma_Y^2=\sigma^2\), then a \((1-\alpha)\cdot100\%\) confidence interval for the difference in the population means, \(\mu_X-\mu_Y\), is given by,
You can find the derivation of this result and some additional exercises in Section 8.2.3 of Probability and Statistics with R.
We saw from the Exploratory Analysis completed in Lab 8 that the yields of potatoes from fertilised and unfertilised fields both seem to follow a normal distribution with approximately equal variances. To simplify our analysis at this stage, let’s first assume that the population variances are known and are equal to \(\sigma^2_\mbox{f}=\sigma^2_\mbox{uf}=4^2\).
That is, we are assuming that the yield of potatoes from fertilised fields follows a \(N(\mu_\mbox{f},4^2)\) distribution, and the yield of potatoes from unfertilised fields follows a \(N(\mu_\mbox{uf}, 4^2)\) distribution.
Now we are ready to compute a 95% confidence interval for the difference in population mean yields of potatoes from fields that were fertilised and fields that were unfertilised using Equation \(\eqref{eq:known-equal}\). That is, a confidence interval for \(\mu_\mbox{f}-\mu_\mbox{uf}\). There a few different ways we could do this in R.
Method 1: Using R as a calculator
In order to use Equation \(\eqref{eq:known-equal}\) above, there are a few values we need to find. We have already assumed that \(\sigma=4\), and we know that \(n_X=130\) and \(n_Y=80\). We can save these values in R using the code below.
That leaves us to find \(\bar x\), \(\bar y\) and \(z_{1-\frac{\alpha}{2}}\), the \((1-\frac{\alpha}{2})\)th quantile from the standard normal distribution. In our case, because we are interested in a 95% confidence interval, \(\alpha=0.05\), so we are looking for the \(\left(1-\frac{0.05}{2}\right)=0.975\)th quantile.
We can find these three values in R using the code below.
xbar <- mean(potato_fertilised$yield)
ybar <- mean(potato_unfertilised$yield)
z <- qnorm(0.975)Now we can see in the Environment tab that \(\bar x=38.75\) tonnes, \(\bar y=38.6\) tonnes and \(z_{0.975}=1.96\).
Calculating the 95% confidence interval is now just a case of subbing in all the values into Equation \(\eqref{eq:known-equal}\). We can do this in R as follows.
(xbar - ybar) + c(-1, 1)*z*sigma*sqrt(1/nx + 1/ny)(xbar - ybar) + c(-1, 1)*z*sigma*sqrt(1/nx + 1/ny)[1] -0.9603676 1.2677138
Using c(-1, 1) allows us to find the lower and upper limit of the confidence interval in one easy step. We would interpret this confidence interval by saying that we are 95% confident that the difference between the population mean yields of potatoes from fertilised and unfertilised fields lies in the interval \(\left[-0.9604,\, 1.2677\right]\).
Note that this interval is not strictly positive i.e. it contains the value 0. This tells us that there is not sufficient evidence that the population mean potato yield from fertilised fields is any different from the population mean potato yield from unfertilised fields at the \(\alpha=0.05\) significance level.
Method 2: Using the z.test() function
Exactly the same result can be reached by using the function z.test(). This function runs through all the steps we manually calculated in the section above, so can speed up the process of finding a confidence interval. This function is part of the PASWR2 package, so we need to load it into the R session. z.test() can take the following arguments,
x =: this is a vector containing the sample data taken from the first normal distribution.y =: this is a vector containing the sample data taken from the second normal distribution.sigma.x =: this is the standard deviation of the first normal distribution.sigma.y =: this is the standard deviation of the second normal distribution.conf.level =: the is the confidence level for the desired interval. It needs to be put in as a decimal, so for a 95% confidence interval, we would use0.95for example.
The code below uses z.test() to find the 95% confidence interval for the difference in population mean potato yield from fertilised and unfertilised fields. We use $conf at the end of the function to extract only the confidence interval. z.test() by itself will return additional results.
library(PASWR2)
z.test(x = potato_fertilised$yield, y = potato_unfertilised$yield,
sigma.x = 4, sigma.y = 4, conf.level = 0.95)$conf[1] -0.9603676 1.2677138
attr(,"conf.level")
[1] 0.95
Again, we would interpret this confidence interval by saying that we are 95% confident that the difference between the population mean yields of potatoes from fertilised and unfertilised fields lies in the interval \(\left[-0.9604,\, 1.2677\right]\).
We would again say that there is not sufficient evidence that the population mean potato yield from fertilised fields is any different from the population mean potato yield from unfertilised fields since this interval contains the value 0.
Method 3: Using the zsum.test() function
Another function contained in the PASWR2 package that can be useful is the zsum.test() function. This allows you to calculate the same confidence interval, but rather than inputting the data from the sample itself, you input some summary statistics that you have either calculated or have been given to you as part of the question.
zsum.test() is a good function to use if you are not given the data from the samples as part of a question, but rather some already calculated summary statistics.
The arguments that zsum.test() can take are,
mean.x =: this is the mean of the first sample. We have already calculated \(\bar x\) and saved it asxbar.mean.y =: this is the mean of the second sample. We have already calculated \(\bar y\) and saved it asybar.sigma.x =: this is the standard deviation of the underlying normal distribution the first sample has been taken from. We know from the question that this is equal to 4 and we have saved it in the objectsigma.sigma.y =: this is the standard deviation of the underlying normal distribution the second sample has been taken from. We know from the question that this is equal to 4 and we have saved it in the objectsigma.n.x =: this is the size of the first sample. In the question this is \(n_X=130\) and this is already saved asnx.n.y =: this is the size of the second sample. In the question this is \(n_Y=80\) and this is already saved asny.conf.level =: this is the confidence level for the desired interval. It needs to be put in as a decimal, so for a 95% confidence interval, we would use0.95for example.
We can input all of these values as arguments in the zsum.test() function to find the 95% confidence interval for the difference in population mean yield of potatoes from the fertilised and unfertilised fields. Again, we use $conf at the end of the function to extract only the confidence interval.
library(PASWR2) #this line only needs to be run once in each R session
zsum.test(mean.x = xbar, mean.y = ybar, sigma.x = sigma, sigma.y = sigma,
n.x = nx, n.y = ny, conf.level = 0.95)$conf[1] -0.9603676 1.2677138
attr(,"conf.level")
[1] 0.95
This confidence interval tells us that we are 95% confident that the difference between the population mean yields of potatoes from fertilised and unfertilised fields lies in the interval \(\left[-0.9604,\, 1.2677\right]\). Once again, we see that this interval contains 0, so there is not sufficient evidence to suggest a difference in population mean potato yield exists between the fertilised and unfertilised fields when the significance level is \(\alpha=0.05\).
3.1.2 Variances are known and unequal (8.2.4)
In the case where the variances of both distributions are known but are unequal, that is \(\sigma_X^2\neq\sigma_Y^2\), then a \((1-\alpha)\cdot100\%\) confidence interval for the difference in the population means, \(\mu_X-\mu_Y\), is given by,
To see some further examples of using this confidence interval, see Section 8.2.4 of Probability and Statistics with R.
From the exploratory analysis completed in Lab 8, the yields of wheat from fertilised and unfertilised fields both approximately follow normal distributions, but with unequal variances. At this stage of the analysis, we’re going to assume the population variances are known and are \(\sigma_{\mbox{f}}^2=3^2\) for yields from fertilised fields, and \(\sigma_{\mbox{uf}}^2=1^2\) for yields from unfertilised fields.
That is, we’re assuming the wheat yields from fertilised fields follow a \(N(\mu_{\mbox{f}},3^2)\) distribution and the wheat yields from unfertilised fields follow a \(N(\mu_{\mbox{uf}}, 1^2)\) distribution.
We can then calculate a 95% confidence interval for the difference in population mean wheat yields from fertilised and unfertilised fields using Equation \(\eqref{eq:known-unequal}\) above. This can be done, again, in a few different ways in R.
Method 1: Using R as a calculator
In order to use Equation \(\eqref{eq:known-unequal}\), we can start by saving all the values that we know from the question as objects in the Environment tab. That is, \(n_X=90\), \(n_Y=120\), \(\sigma_X=3\) and \(\sigma_Y=1\).
nx <- 90
ny <- 120
sigmax <- 3
sigmay <- 1In order to use the formula for the confidence interval stated above, we still need to find \(\bar x\), \(\bar y\) and \(z_{1-\frac{\alpha}{2}}\). We can find each of these using the code below.
xbar <- mean(wheat_fertilised$yield)
ybar <- mean(wheat_unfertilised$yield)
z <- qnorm(0.975)Now we can see in the Environment tab that \(\bar x=23.8\) tonnes, \(\bar y=17.98\) tonnes and \(z_{0.975}=1.96\).
Calculating the 95% confidence interval is now just a case of subbing in all the values into Equation \(\eqref{eq:known-unequal}\). We can do this in R as follows.
(xbar - ybar) + c(-1, 1)*z*sqrt(sigmax^2/nx + sigmay^2/ny)(xbar - ybar) + c(-1, 1)*z*sqrt(sigmax^2/nx + sigmay^2/ny)[1] 5.174600 6.464806
We would interpret this confidence interval by saying that we are 95% confident that the difference between the population mean yields of wheat from fertilised and unfertilised fields lies in the interval \(\left[5.1746,\, 6.4648\right]\).
Note that this interval is strictly positive which means there is statistically significant evidence of a difference between the population mean yields of wheat from the two types of field at the 5% significance level.
Method 2: Using the z.test() function
As we have seen, the function z.test() from the PASWR2 package is useful if we have been provided the data and know the variances of the two normal distributions that the groups follow.
In this case, we can use z.test() to calculate a 95% confidence interval for the difference in population mean wheat yields from fertilised and unfertilised fields as follows.
library(PASWR2) #this line only needs to be run once in each R session
z.test(x = wheat_fertilised$yield, y = wheat_unfertilised$yield,
sigma.x = 3, sigma.y = 1, conf.level = 0.95)$conf[1] 5.174600 6.464806
attr(,"conf.level")
[1] 0.95
Again, we would interpret this confidence interval by saying that we are 95% confident that the difference between the population mean yields of wheat from fertilised and unfertilised fields lies in the interval \(\left[5.1746,\, 6.4648\right]\). We also see statistically significant evidence that there is a difference in these two population means at the 5% significance level.
Method 3: Using the zsum.test() function
To save us from having to use R as a calculator to find the confidence interval, we can use the function zsum.test() from the PASWR2 package.
We already have saved in the Environment tab that,
xbar\(=\bar x=23.8\)ybar\(=\bar y=17.98\)sigmax\(=\sigma_X=3\)sigmay\(=\sigma_Y=1\)nx\(=n_X=90\)ny\(=n_Y=120\)
We can then find the 95% confidence interval for the difference in population mean wheat yield from fertilised and unfertilised fields as follows.
library(PASWR2) #this line only needs to be run once in each R session
zsum.test(mean.x = xbar, mean.y = ybar, sigma.x = sigmax, sigma.y = sigmay,
n.x = nx, n.y = ny, conf.level = 0.95)$conf[1] 5.174600 6.464806
attr(,"conf.level")
[1] 0.95
This function is mostly useful if we are not provided with the original sample data, but rather summary statistics.
We can again see that the confidence interval is \(\left[5.1746,\, 6.4648\right]\), so we say we are 95% confident that the difference between the population mean yields of wheat from fertilised and unfertilised fields lies in this interval and that there is statistically significant evidence of a difference in these two population means at the 5% significance level.
3.1.3 Variances are unknown and assumed equal (8.2.5)
When random samples have been taken from two normal distributions where the variances are unknown but assumed to be equal, a \((1-\alpha)\cdot100\%\) confidence interval for \(\mu_X-\mu_Y\) is given by,
\(\nu_p\) represents the degrees of freedom for the associated \(t\) distribution. The degrees of freedom can be found as \(\nu_p=n_X+n_Y-2\).
\(s_p\) is a pooled estimate of the standard deviation that takes into account the sample sizes, \(n_X\) and \(n_Y\), taken from each distribution. An estimate for the pooled variance can be found using, \[s_p^2=\frac{\left(n_X-1\right)s_X^2+\left(n_Y-1\right)s_Y^2}{n_X+n_Y-2}\]
where, \(s_X^2=\frac{\sum_{i=1}^{n_X}x_i^2-n_X\bar x^2}{n_X-1}\) and \(s_Y^2=\frac{\sum_{i=1}^{n_Y}y_i^2-n_Y\bar y^2}{n_Y-1}\).
Remember to take the square root of the estimated variance to find the estimate of standard deviation.
To see more examples of calculating these confidence intervals, see Section 8.2.5 of Probability and Statistics with R.
So far, when calculating the confidence interval for the difference in population mean yield of potatoes, we have been told the value for \(\sigma^2_X=\sigma^2_Y=\sigma^2\), the population variance. It might not be the case that we know this value, in which case we would use Equation \(\eqref{eq:unknown-equal}\) stated above.
The difference between this one and Equation \(\eqref{eq:known-equal}\) from Section 3.1.1 is that the \(t\)-distribution is now used rather than the standard normal, and that a pooled estimate of the standard deviation is used. This is because we do not know a value for the population standard deviation \(\sigma\).
We don’t actually know the standard deviations of the distributions that potato yields follow, but based on our exploratory analysis from Lab 8, we can assume that they are equal. We also know that potato yields from both fertilised and unfertilised fields approximately follow normal distributions. Therefore, a 95% confidence interval for the difference in mean population potato yields from fertilised and unfertilised fields can be calculated using Equation \(\eqref{eq:unknown-equal}\). The confidence interval can be found in R in a few different ways.
Method 1: Using R as a calculator
We can start by saving \(n_X=130\) and \(n_Y=80\) as objects in the Environment tab.
nx <- 130
ny <- 80Then we can calculate \(\bar x\), \(\bar y\) and \(t_{1-\frac{\alpha}{2};\nu_p}\), which is the \((1-\frac{\alpha}{2})\)th quantile from the \(t\)-distribution with \(\nu_p=n_X+n_Y-2\) degrees of freedom. Because we are interested in a 95% confidence interval, \(\alpha=0.05\), so the quantile we are looking for is the \(\left(1-\frac{0.05}{2}\right)=0.975\)th quantile. To find a quantile from the \(t\)-distribution, we use the qt() function.
xbar <- mean(potato_fertilised$yield)
ybar <- mean(potato_unfertilised$yield)
t <- qt(0.975, nx + ny - 2)In order to use Equation \(\eqref{eq:unknown-equal}\), we also need to calculate the estimate for the pooled variance, \(s_p^2\). Before we can do this, we need to find the sum of squares for both groups, \(s_X^2\) and \(s_Y^2\). This can be done using the code below, which first of all calculates \(\sum_{i=1}^{n_X}x_i^2\) and \(\sum_{i=1}^{n_Y}y_i^2\) and saves them as sum_x2 and sum_y2 respectively. These values are then used to find \(s_X^2\) and \(s_Y^2\), which have been saved as s2x and s2y respectively.
sum_x2 <- sum(potato_fertilised$yield^2)
sum_y2 <- sum(potato_unfertilised$yield^2)
s2x <- (sum_x2 - nx*xbar^2)/(nx - 1)
s2y <- (sum_y2 - ny*ybar^2)/(ny - 1)Note that here we have calculated the variance of both groups ‘by hand’. We could do exactly the same thing using the var() function, which calculates the variance of a vector of values.
Now we can find the estimate for the pooled variance, \(s_p^2\), using the values for the sum of squares for both groups. This has been saved as sp2.
sp2 <- ((nx - 1)*s2x + (ny - 1)*s2y)/(nx + ny - 2)Now we can sub all of the values that have been calculated into Equation \(\eqref{eq:unknown-equal}\) as follows.
(xbar - ybar) + c(-1, 1)*t*sqrt(sp2)*sqrt(1/nx + 1/ny)(xbar - ybar) + c(-1, 1)*t*sqrt(sp2)*sqrt(1/nx + 1/ny)[1] -0.9695833 1.2769294
We are 95% confident that the difference between the population mean yields of potatoes from fertilised and unfertilised fields lies in the interval \(\left[-0.9696,\, 1.2769\right]\) when the population variances are unknown, but assumed equal.
Method 2: Using the t.test() function
Another way in which we can calculate this 95% confidence interval is by using the t.test() function. This function is already part of R, so you don’t need to load any packages. The only arguments that t.test() needs are,
x =: this is the vector of normally distributed values from the first group.y =: this is the vector of normally distributed values from the second group.var.equal =: this takes the valueTRUEorFALSE, indicating whether we believe the variances, and hence the standard deviations, of both groups are the same.conf.level =: this is the confidence interval for the desired interval. It needs to be put in as a decimal, so for a 95% confidence interval, we would use0.95.
The code below uses t.test() to find the 95% confidence interval for the difference in population mean potato yield from fertilised and unfertilised fields. $conf is used to extract only the confidence interval as by itself, t.test() will return additional results.
t.test(x = potato_fertilised$yield, y = potato_unfertilised$yield,
var.equal = TRUE, conf.level = 0.95)$conf[1] -0.9695833 1.2769294
attr(,"conf.level")
[1] 0.95
We see the same result, that the 95% confidence interval for the difference in population mean potato yields from the two types of field is \(\left[-0.9696,\, 1.2769\right]\).
Method 3: Using the tsum.test() function
If we had been given summary statistics of the two groups, but did not know the variances of the groups, we can use the function tsum.test(). This is part of the PASWR2 package, so make sure it is loaded into your R session.
The arguments that tsum.test() can take are,
mean.x =: this is the mean of the first sample, \(\bar x\).mean.y =: this is the mean of the second sample, \(\bar y\).s.x =: this is the estimate of the standard deviation of the first sample. We have already found the estimated variance, \(s_X^2\), and saved it ass2x, so we just need to take the square root.s.y =: this is the estimate of the standard deviation of the second sample. We have already found the estimated variance, \(s_Y^2\), and saved it ass2y, so we just need to take the square root.n.x =: this is the sample size of the first group, \(n_X\).n.y =: this is the sample size of the second group, \(n_Y\).var.equal =: this takes the valueTRUEorFALSE, indicating whether we believe the variances, and hence the standard deviations, of both groups are the same.conf.level =: this is the confidence interval for the desired interval.
We can input all of these values as arguments in the tsum.test() function to find the 95% confidence interval for the difference in population mean yield of potatoes from fertilised and unfertilised fields. Again, $conf is needed to show only the confidence interval.
library(PASWR2) #this line only needs to be run once in each R session
tsum.test(mean.x = xbar, mean.y = ybar, s.x = sqrt(s2x), s.y = sqrt(s2y),
n.x = nx, n.y = ny, var.equal = TRUE, conf.level = 0.95)$conf[1] -0.9695833 1.2769294
attr(,"conf.level")
[1] 0.95
Once again, the 95% confidence interval returned to us for the difference between the population mean potato yield from fertilsied and from unfertilised fields is \(\left[-0.9696,\, 1.2769\right]\).
Task
Task
Open the statistical report you have saved from Lab 8. This should already contain complete Introduction and Exploratory Analysis sections.
In the Statistical Analysis section, add in an interpretation of the 95% confidence interval for the difference in population mean potato yields between fertilised and unfertilised fields. Make sure you state what the assumptions you have made to calculate this confidence interval are, the values in the interval and how this would be interpreted.
3.1.4 Variances are unknown and unequal (8.2.6)
When random samples have been taken from two normal distributions where the variances, \(\sigma_X^2\) and \(\sigma_Y^2\), are unknown and they are assumed unequal, a \((1-\alpha)\cdot100\%\) confidence interval for \(\mu_X-\mu_Y\) is given by,
\(\nu\) represents the degrees of freedom for the associated \(t\) distribution. The degrees of freedom can be found using, \[\nu=\frac{\left(\frac{s_X^2}{n_X}+\frac{s_Y^2}{n_Y}\right)^2}{\frac{\left(s_X^2/n_X\right)^2}{n_X-1}+\frac{\left(s_Y^2/n_Y\right)^2}{n_Y-1}}\]
where, \(s_X^2=\frac{\sum_{i=1}^{n_X}x_i^2-n_X\bar x^2}{n_X-1}\) and \(s_Y^2=\frac{\sum_{i=1}^{n_Y}y_i^2-n_Y\bar y^2}{n_Y-1}\).
See Section 8.2.6 of Probability and Statistics with R for further examples of finding these confidence intervals.
The standard deviations of the distributions that wheat yields follow are not actually known and from the exploratory analysis, it seems that the population variances are likely unequal. Assuming this is the case, we can still calculate a 95% confidence interval for the difference in population mean wheat yield between fertilised and unfertilised fields using Equation \(\eqref{eq:unknown-unequal}\) above.
In order to use this result, both groups must be normally distributed. We have already seen that this is the case from the exploratory analysis, but if we hadn’t checked make sure to check both groups follow a normal distribution, using QQ plots.
There a few different ways in which we can calculate the desired 95% confidence interval.
Method 1: Using R as a calculator
We start by saving \(n_X=90\) and \(n_Y=120\) as objects in the Environment tab and calculating the mean of each group.
nx <- 90
ny <- 120
xbar <- mean(wheat_fertilised$yield)
ybar <- mean(wheat_unfertilised$yield)We can see now, in the Environment tab, that \(\bar x=23.8\) and \(\bar y=17.98\).
Before we can find the \(\left(1-\frac{\alpha}{2}\right)\)th quantile from the \(t\)-distribution, we need to find the associated degrees of freedom \(\nu\). This involves a few steps. First of all, we find \(\sum_{i=1}^{n_X}x_i^2\) and \(\sum_{i=1}^{n_Y}y_i^2\).
sum_x2 <- sum(wheat_fertilised$yield^2)
sum_y2 <- sum(wheat_unfertilised$yield^2)Running the above code will show us that \(\sum_{i=1}^{n_X}x_i^2=5.171523\times 10^{4}\) and \(\sum_{i=1}^{n_Y}y_i^2=3.893489\times 10^{4}\). We can use these values to find the variance of the two wheat yield samples, \(s_X^2\) and \(s_Y^2\).
s2x <- (sum_x2 - nx*xbar^2)/(nx - 1)
s2y <- (sum_y2 - ny*ybar^2)/(ny - 1)If we had used the var() function we would be left with the same result, that \(s_X^2=8.21\) and \(s_Y^2=1.13\). Now we can use these values to find \(\nu\) as follows (this value is saved as df) and find the \(\left(1-\frac{\alpha}{2}\right)\)th quantile from the \(t\)-distribution with \(\nu\) degrees of freedom using the function qt().
df <- (s2x/nx + s2y/ny)^2 / ((s2x/nx)^2/(nx-1) + (s2y/ny)^2/(ny - 1))
t <- qt(0.975, df)Finally, we can sub all of these values into the formula for the confidence interval.
(xbar - ybar) + c(-1, 1)*t*sqrt(s2x/nx + s2y/ny)(xbar - ybar) + c(-1, 1)*t*sqrt(s2x/nx + s2y/ny)[1] 5.190980 6.448426
We would interpret this interval by saying that we are 95% confident that the difference between the population mean wheat yield from fertilised and unfertilised fields lies in the interval \(\left[5.191,\, 6.4484\right]\). This is true for the scenario where we don’t know the variance of each underlying distribution and assume it is different for the two groups.
Method 2: Using the t.test() function
We can find the same interval using the t.test() function.
To find the 95% confidence interval for the difference in population mean wheat yields from fertilised and unfertilised fields, assuming the variances are unequal, we would use the code below.
t.test(x = wheat_fertilised$yield, y = wheat_unfertilised$yield,
var.equal = FALSE, conf.level = 0.95)$conf[1] 5.190980 6.448426
attr(,"conf.level")
[1] 0.95
This again tells us that the interval is \(\left[5.191,\, 6.4484\right]\), so we are 95% confident the difference in population mean wheat yields from fertilised and unfertilised fields lies in this interval.
Method 3: Using the tsum.test() function
We could also use the summary statistics, already calculated, and the function tsum.test() from the PASWR2 package. This would look as follows.
library(PASWR2) #this line only needs to be run once in each R session
tsum.test(mean.x = xbar, mean.y = ybar, s.x = sqrt(s2x), s.y = sqrt(s2y),
n.x = nx, n.y = ny, var.equal = FALSE, conf.level = 0.95)$conf[1] 5.190980 6.448426
attr(,"conf.level")
[1] 0.95
We would say that we are 95% confident that the difference in population mean yields of wheat from the two types of field lies in the interval \(\left[5.191,\, 6.4484\right]\).
Task
Task
In the Statistical Analysis section of your report, add in an interpretation of the 95% confidence interval for the difference in population mean wheat yields between fertilised and unfertilised fields.
Make sure you state what the assumptions you have made to calculate this confidence interval are, the values in the interval and how this would be interpreted.